Skip to content

v1.0: rewrite parsing backend on pypdfium2#124

Open
codereverser wants to merge 5 commits into
mainfrom
feature/v1.0
Open

v1.0: rewrite parsing backend on pypdfium2#124
codereverser wants to merge 5 commits into
mainfrom
feature/v1.0

Conversation

@codereverser
Copy link
Copy Markdown
Owner

@codereverser codereverser commented May 18, 2026

Summary

v1.0 is a full rewrite of the CAS parsing layer:

  • Backend: pdfminer.six + PyMuPDF → pypdfium2 (Apache-2.0 / BSD-3). casparser is now pure MIT end-to-end — no transitive GPL/AGPL obligations.
  • Per-issuer parsers: CAMS/KFin DETAILED, CAMS/KFin SUMMARY, NSDL, and CDSL each have a dedicated parser tuned to its template family, replacing the single regex-on-text pipeline.
  • Drops the mupdf / fast extras and the --force-pdfminer CLI flag (force_pdfminer= kwarg kept as a no-op with DeprecationWarning).
  • Minimum Python: 3.11.

Why

  • v0.8 had three NSDL/CDSL regex bugs corrupting holdings (misplaced-UCC-as-folio, space-merged folio+units, dropped NSDL HDFC subaccount on CDSL multi-account statements), plus a MutualFund.fix_float validator miss on Optional[Decimal] aliased fields.
  • New CAMS/KFin 2026 templates broke the existing SUMMARY and MF-Holdings regexes (12 folios returning 0; zero-balance schemes dropped).
  • PyMuPDF's GPL/AGPL footprint complicated downstream packaging.

What's in the box

  • New parsers under casparser/parsers/:
    • pageobj.py — shared page-object atom extractor (NSDL/CDSL)
    • extract.py — char/line extractor (CAMS/KFin)
    • cams_detailed.py, cams_summary.py, nsdl.py, cdsl.py
    • detect.py — file-type sniffer; wraps PdfiumError into CASParseError / IncorrectPasswordError
    • _classify.py, _isin.py, _investor.py — shared helpers
  • All v0.8 fields populated: investor_info, folio.PAN/KYC/PANKYC, scheme.isin/amfi/type, scheme.nominees, scheme.valuation.cost.
  • investor_info is now required on both CASData and NSDLCASData (matches the contract of "every CAS contains an investor block").
  • New "Fund House" AMC suffix recognised (Zerodha).
  • CDSL multi-account statements (PAYTM + NEXTBILLION + FINWIZARD + HDFC + ZERODHA on one PDF) parse correctly; DIRECT (non-ARN) distribution-mode rows populate PnL/return.
  • ISIN/AMFI enrichment via casparser-isin with a direct-ISIN fallback path for templates where multi-line registrar rendering mangles the RTA token.

Bug fixes that landed alongside the rewrite

  • CAMS SUMMARY valuation.date was mis-parsing to date(201, 1, 1) — column boundary + Pydantic coercion fix.
  • KFin SUMMARY 2026 zero-balance schemes (HMTOGT, HPREG) no longer dropped.
  • CAMS SUMMARY 2026 (with new ISIN column) parses again.

Breaking changes

  • casparser.types.CASData.investor_info: Optional[InvestorInfo]InvestorInfo (parser raises CASParseError if it can't find the block).
  • casparser.types.NSDLCASData.investor_info: same change.
  • casparser.types.NSDLCASData.file_type: Optional[FileType] = NoneFileType.
  • ProcessedCASData and PartialCASData removed from casparser.types (they were internal to the old pipeline).
  • casparser.process package removed; surviving helpers moved to casparser.parsers._classify and casparser.parsers._isin.
  • --force-pdfminer / force_pdfminer= is a no-op (emits DeprecationWarning).

Testing

  • 24/24 unit + integration tests pass (tests/test_pypdfium.py, tests/test_helpers.py, tests/test_gains.py, tests/casparser/test_cli.py).
  • 13/13 production samples (CAMS × 5, KFin × 5, NSDL × 1, CDSL × 2) parse cleanly end-to-end with populated investor_info, ISIN/AMFI/type, PAN/KYC, valuation.cost, nominees.

Test plan

  • CI green on Python 3.11 / 3.12 / 3.13
  • casparser CLI on a sample CAMS/KFin/NSDL/CDSL PDF
  • pip install -U casparser (or uv sync) installs without pulling pdfminer.six / PyMuPDF
  • CHANGELOG.md and README.md changes reflect the new state

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 96.70901% with 57 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.04%. Comparing base (0029577) to head (09773ac).

Files with missing lines Patch % Lines
casparser/parsers/nsdl.py 97.23% 11 Missing ⚠️
casparser/parsers/pageobj.py 95.13% 10 Missing ⚠️
casparser/parsers/cams_detailed.py 96.88% 8 Missing ⚠️
casparser/parsers/extract.py 95.21% 8 Missing ⚠️
casparser/parsers/cams_summary.py 96.45% 7 Missing ⚠️
casparser/parsers/cdsl.py 98.88% 3 Missing ⚠️
casparser/types.py 86.96% 3 Missing ⚠️
casparser/parsers/__init__.py 95.84% 2 Missing ⚠️
casparser/parsers/_investor.py 96.93% 2 Missing ⚠️
casparser/parsers/_isin.py 88.89% 2 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #124      +/-   ##
==========================================
+ Coverage   88.66%   97.04%   +8.39%     
==========================================
  Files          18       19       +1     
  Lines        1463     2295     +832     
==========================================
+ Hits         1297     2227     +930     
+ Misses        166       68      -98     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Rebuilds the parsing layer for v1.0 on top of pypdfium2 (Apache-2.0 /
BSD-3) so casparser ships pure MIT end-to-end; the prior pdfminer.six
+ PyMuPDF dependencies are dropped along with the entire `casparser/
process/` regex-tokenisation pipeline they fed.

Engine (casparser/parsers/extract.py, pageobj.py)
=================================================

`extract.py` walks PDF page objects (one atom per text-show op),
maps glyphs to their parent atom via PDFium's
`FPDFText_GetTextObject`, deduplicates same-font overlapping atoms,
then emits `Char`/`Line`/`Page` shaped output that downstream
parsers consume. Atom-level dedup replaces all per-character
overlay heuristics: when two atoms share a font, x-overlap by
>=50% of the narrower atom's width, and sit 0.05-3.0pt apart in
y, we drop the one further from the row's median baseline. That
handles the date-twin artefact (same date column rendered twice
with a small y-offset, glyphs interleaving by x to produce garbage
like `2020 -> 22002200`) without the multi-stage sub-cluster
filters earlier prototypes used.

`pageobj.py` exposes the atoms + their column/block grouping that
the NSDL/CDSL parsers operate on directly. The same Atom primitive
backs the investor extractor.

Per-issuer parsers
==================

- `cams_detailed.py` / `cams_summary.py` consume the Line stream
  for CAMS + KFin DETAILED and SUMMARY templates.
- `nsdl.py` reads the page-2 account roster, walks per-account
  holdings sections (equities + mutual funds + corporate bonds in
  both summary and detailed forms). Section-aware routing handles
  the case where multiple holding types share the same 18-cell
  detailed table header by tracking `cur_section` from the
  preceding marker block. The page-2 roster accepts both the
  4-cell (broker + DP/Client joined) and 5-cell (broker, then
  DP/Client) variants.
- `cdsl.py` mirrors NSDL's structure for the CDSL CAS template.

Types
=====

- Adds `Bond` model with optional coupon_rate / coupon_frequency /
  maturity_date / face_value / market_price; required fields are
  isin, num_bonds, value. Surfaces on `DematAccount.bonds`.
- `investor_info` is now required on `CASData` and `NSDLCASData`.

Performance
===========

The dispatcher opens the PDF document exactly once per
`read_cas_pdf` call and threads the handle through detect /
parser / investor extractor via an `_doc=` kwarg. NSDL/CDSL
additionally share the extracted atoms between the holdings
parser and the investor extractor.
Replaces the v0.8 pdfminer / PyMuPDF test files with a per-issuer
e2e suite plus a focused unit-test layer.

Layout
======

- `tests/conftest.py` — module-scoped fixtures for each fixture PDF
  (CAMS / KFin / NSDL / CDSL detailed + summary). Each fixture
  loader skips its dependent tests when the corresponding env var
  isn't set, so contributors without the encrypted bundle can still
  run the unit-test portion.
- `tests/_assertions.py` — invariant helpers shared across the e2e
  suite. Designed to lock in correctness without encoding the real
  rupee figures from private fixtures.
- `tests/test_cams.py`, `test_kfin.py`, `test_nsdl.py`,
  `test_cdsl.py` — per-issuer e2e tests.
- `tests/test_errors.py` — error-path + back-compat shim tests.
- `tests/test_demat_units.py` — NSDL/CDSL parser unit tests using
  synthetic Block/Cell fixtures (no real ISINs, names, or IDs).
- `tests/test_helpers.py`, `tests/test_gains.py`,
  `tests/test_gains_e2e.py` — existing helper / gains coverage,
  retained.

Arithmetic invariants
=====================

The e2e tests verify parsing correctness without depending on
specific rupee amounts:

- **CAMS / KFin DETAILED**:
  scheme.close * scheme.valuation.nav == scheme.valuation.value
  and scheme.open + sum(txn.units) == scheme.close.
- **NSDL / CDSL**:
  sum(eq.value + mf.value + bd.value) == account.balance per
  account; mf.balance * mf.nav == mf.value;
  bond.num_bonds * bond.face_value == bond.value (summary form);
  bond.num_bonds * bond.market_price == bond.value (detailed
  form).

These catch column-swap, decimal-parse, anchor-drift, and missed-
transaction bugs without encoding portfolio totals in the repo.

Removed
=======

- `tests/test_pdfminer.py`, `tests/test_mupdf.py`,
  `tests/test_process.py` — backend-specific suites for the
  v0.8 stack.
- `tests/test_pypdfium.py`, `tests/base.py` — the intermediate
  single-file test suite is superseded by the per-issuer split.
- `tests/pytest.ini` — empty file masked the pyproject.toml
  pytest config.
- `pyproject.toml`: bumps version to 1.0.0, drops pdfminer.six
  (AGPL-3.0+) and PyMuPDF (GPL-3.0+ / commercial) from runtime
  deps, replaces with pypdfium2 (Apache-2.0 / BSD-3). Loosens
  remaining version bounds where compatible (click <10, rich <16,
  pypdfium2 <7, pydantic <3, etc.) and refreshes the dev-group
  upper bounds (pytest <10, pytest-cov <8, ipython, coverage).
- Python floor lifts to 3.11 (3.10 EOL anyway).
- `uv.lock`: regenerated against the new dep set.
- `.github/workflows/run-pytest.yml`: switches CI to Python 3.12,
  decrypts `tests/files.enc` for the encrypted fixture bundle, and
  exposes the per-fixture env-var matrix to pytest. PyPI publish
  workflow updated to drop the dropped backends.
- `licenses/AGPL-3.0+.txt` + `licenses/GPL-3.0+.txt`: removed —
  no longer required to redistribute since the GPL/AGPL deps are
  gone.
- `README.md`: documents the v1.0 backend swap and refreshes
  external links. `CHANGELOG.md` gets a 1.0.0 section.
v0.9.0 shipped a PyMuPDF-1.25 compatibility fix on top of v0.8's
existing backend; v1.0 has already replaced that backend with
pypdfium2, so the v0.9 parser patches don't apply. The merge keeps
v1.0's parser layer and folds in the v0.9 metadata changes that
are still relevant:

- `casparser-isin>=2026.5.1` (DB format v2 with sebi_category /
  last_seen / ISIN-first lookup priority) — adopted.
- `pdfminer.six` and the `mupdf`/`fast` PyMuPDF extras stay
  removed (1.0.0's pure-pypdfium2 stack).
- `MutualFund.fix_float` aliased-field bug fix (v0.9 patched it
  on the v0.8 model; v1.0's model already carries the same fix).
- CI matrix: adopt v0.9's `[3.11, 3.12, 3.13]` Python matrix; drop
  `--all-extras` from `uv sync` (no extras to install any more).
- CHANGELOG keeps the 1.0.0 entry on top and a condensed 0.9.0
  entry below it for historical record.

Files v0.9 modified that v1.0 had already deleted are kept deleted:
- casparser/parsers/mupdf.py
- casparser/process/{__init__,cas_detailed,cas_summary,cdsl_statement,
  nsdl_statement,regex,utils}.py

Tests: 151/151 with private fixtures, 87/87 + 64 skipped without.
GitHub deprecation notice — actions running on Node.js 20 will be
forced to Node.js 24 from 2026-06-02. Bump every referenced action
to its current Node-24-native major:

  actions/checkout            v4 → v6
  actions/setup-python        v5 → v6
  astral-sh/setup-uv          v5 → v8
  codecov/codecov-action      v5 → v6

These are all backward-compatible at the workflow-input level; no
input changes required.
@codereverser codereverser changed the title v1.0: rewrite parsing backend on pypdfium2; full feature parity with v0.8 v1.0: rewrite parsing backend on pypdfium2 May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant